Exercise 1

Install and load the janitor package. Use its clean_names() function on the Pokémon data, and save the results to work with for the rest of the assignment. What happened to the data? Why do you think clean_names() is useful?

pokemon <- read.csv("data/Pokemon.csv") %>% clean_names()
pokemon %>% 
  head(30) %>% 
  kable() %>% 
  kable_styling(full_width = F) %>% 
  scroll_box(width = "100%", height = "200px")
x name type_1 type_2 total hp attack defense sp_atk sp_def speed generation legendary
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 Charmander Fire 309 39 52 43 60 50 65 1 False
5 Charmeleon Fire 405 58 64 58 80 65 80 1 False
6 Charizard Fire Flying 534 78 84 78 109 85 100 1 False
6 CharizardMega Charizard X Fire Dragon 634 78 130 111 130 85 100 1 False
6 CharizardMega Charizard Y Fire Flying 634 78 104 78 159 115 100 1 False
7 Squirtle Water 314 44 48 65 50 64 43 1 False
8 Wartortle Water 405 59 63 80 65 80 58 1 False
9 Blastoise Water 530 79 83 100 85 105 78 1 False
9 BlastoiseMega Blastoise Water 630 79 103 120 135 115 78 1 False
10 Caterpie Bug 195 45 30 35 20 20 45 1 False
11 Metapod Bug 205 50 20 55 25 25 30 1 False
12 Butterfree Bug Flying 395 60 45 50 90 80 70 1 False
13 Weedle Bug Poison 195 40 35 30 20 20 50 1 False
14 Kakuna Bug Poison 205 45 25 50 25 25 35 1 False
15 Beedrill Bug Poison 395 65 90 40 45 80 75 1 False
15 BeedrillMega Beedrill Bug Poison 495 65 150 40 15 80 145 1 False
16 Pidgey Normal Flying 251 40 45 40 35 35 56 1 False
17 Pidgeotto Normal Flying 349 63 60 55 50 50 71 1 False
18 Pidgeot Normal Flying 479 83 80 75 70 70 101 1 False
18 PidgeotMega Pidgeot Normal Flying 579 83 80 80 135 80 121 1 False
19 Rattata Normal 253 30 56 35 25 35 72 1 False
20 Raticate Normal 413 55 81 60 50 70 97 1 False
21 Spearow Normal Flying 262 40 60 30 31 31 70 1 False
22 Fearow Normal Flying 442 65 90 65 61 61 100 1 False
23 Ekans Poison 288 35 60 44 40 54 55 1 False
24 Arbok Poison 438 60 85 69 65 79 80 1 False


The clean_names() function renamed the column labels / names of the predictors by removing any uppercase letters and replacing the . delimiters with underscore characters to make it “snake_case” (the default setting of clean_names()).
———— ### Exercise 2

Using the entire data set, create a bar chart of the outcome variable, type_1.

How many classes of the outcome are there? Are there any Pokémon types with very few Pokémon? If so, which ones?

For this assignment, we’ll handle the rarer classes by grouping them, or “lumping them,” together into an ‘other’ category. Using the forcats package, determine how to do this, and lump all the other levels together except for the top 6 most frequent (which are Bug, Fire, Grass, Normal, Water, and Psychic).

Convert type_1 and legendary to factors.

p<- ggplot(pokemon, aes(x=type_1)) + geom_bar() +
  labs(x="Primary Type") + theme_gray() + theme(axis.text.x=element_text(angle=45,hjust=1,vjust=0.5)) 
ggplotly(p)
# Convert type_1 and legendary to factors
pokemon <- pokemon %>%
  mutate(type_1=factor(type_1)) %>%
  mutate(legendary=factor(legendary)) %>%
  mutate(generation=factor(generation))
pokemon$type_1 <- fct_collapse(pokemon$type_1, Other=c("Steel", "Rock", "Poison", "Ice",
                                                "Ground", "Ghost","Flying", "Fighting", 
                                                "Fairy","Electric","Dragon","Dark")) 


There are initially 18 different primary types ofPokémon — the flying type and fairy type have significantly low numbers of Pokémon. While this makes sense for the fairy type (since it wasn’t added until Generation VI) it’s a bit odd that a primary type existing since Generation I has such a low count.
————– ### Exercise 3

Perform an initial split of the data. Stratify by the outcome variable. You can choose a proportion to use. Verify that your training and test sets have the desired number of observations.

Next, use v-fold cross-validation on the training set. Use 5 folds. Stratify the folds by type_1 as well. Hint: Look for a strata argument.

Why do you think doing stratified sampling for cross-validation is useful?

set.seed(3435)

pokemon_split <- initial_split(pokemon, prop = 0.7, strata = type_1)
pokemon_train <- training(pokemon_split)
pokemon_test <- testing(pokemon_split)
# 5-fold cross validation
pokemon_folds  <- vfold_cv(pokemon_train, v = 5, strata = type_1)
# Verify that the training and testing sets have the correct number of observations
data.frame(train = c(count(pokemon_train)), test = c( count(pokemon_test) )) %>% rename( Train = n, Test = n.1)
##   Train Test
## 1   559  241



Stratified sampling in cross-validation ensures that each fold has roughly the same distribution / proportion of classes in each fold. This ensures that models produced on each fold should scale well to the original dataset, and will likely better generalize to unseen data.


Exercise 4

Create a correlation matrix of the training set, using the corrplot package. Note: You can choose how to handle the categorical variables for this plot; justify your decision(s).

What relationships, if any, do you notice?

# Hot-One the legendary predictor
pokemon_train$is_legendary <- as.numeric(pokemon_train$legendary) - 1
pokemon_train %>%
  select(where(is.numeric)) %>%
  cor() %>%
  corrplot(type = 'lower', diag = FALSE,
           method = 'color')

# Get rid of 'dummy' column for one-hot encoding since we only
# use that for the corrplot in this assignment
pokemon_train <- subset(pokemon_train, select=-is_legendary)

We are able to one-hot the legendary status of a Pokémon incredibly easily because there are only two categorical classes: True and False. Despite it being easy to include, it is also worthwhile keeping the legendary (actually a temporary is_legendary column containing the one-hot-encoded values) since it is a valid question of whether “legendary” Pokémon have better stats than their ordinary counterparts.

However, we choose not to include the type_1 categorical variable since we are ultimately lumping 12 potentially very different classes into a new “Other” class, essentially diluting whatever possible correlation information those classes could have contributed. While we could have potentially included type_2 (since the classes are not collapsed), it would be a bit difficult to one-hot encode 18 different classes in such a way that correlation information is readable for each individual class.

After looking at the results of the correlation matrix, we first address the expected relations. Possibly the most trivial is the perfect correlation between generation and x (which is the PokéDex entry number) — for those who have played the games, this is completely expected since new Pokémon introduced in each generation are added at the end of where the PokéDex left off. Thus, generation II Pokémon’s PokéDex entries begin after the last Generation I Pokémon. In addition, the relation between ‘total’ and each Pokémon stat is expected, since the total is simply the sum of the stats for each Pokémon.

Possibly the most surprising result is that there are no negative correlations between Pokémons’ stats. As a competitive game / series, Pokémon must be kept an even playing field to ensure players are able to strategically enjoy the game. What this usually means is that Pokémon that specialize in attack (e.g. most Fire-type Pokémon) have lower defense of special defense stats — however, the correlation plot above does not indicate this in any way.

Exercise 5

Set up a recipe to predict type_1 with legendary, generation, sp_atk, attack, speed, defense, hp, and sp_def.

  • Dummy-code legendary and generation;

  • Center and scale all predictors.

pokemon_recipe <-
  recipe(type_1 ~ legendary + generation + sp_atk + attack + speed + defense + hp + sp_def, data = pokemon_train) %>% 
  step_dummy(legendary) %>%
  step_dummy(generation) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors())

prep(pokemon_recipe) %>% bake(new_data = pokemon_train)

Exercise 6

We’ll be fitting and tuning an elastic net, tuning penalty and mixture (use multinom_reg() with the glmnet engine).

Set up this model and workflow. Create a regular grid for penalty and mixture with 10 levels each; mixture should range from 0 to 1. For this assignment, let penalty range from 0.01 to 3 (this is on the identity_trans() scale; note that you’ll need to specify these values in base 10 otherwise).

elastic_net_model <- multinom_reg(mixture = tune(),
                              penalty = tune()) %>%
  set_engine("glmnet")

elastic_net_wflow <- workflow() %>%
  add_model(elastic_net_model) %>%
  add_recipe(pokemon_recipe)

elastic_net_grid <- grid_regular(penalty(range=c(0.01, 3), trans = identity_trans()),
                                 mixture(range(0,1)), 
                                 levels = 10)

Exercise 7

Now set up a random forest model and workflow. Use the ranger engine and set importance = "impurity"; we’ll be tuning mtry, trees, and min_n. Using the documentation for rand_forest(), explain in your own words what each of these hyperparameters represent.

Create a regular grid with 8 levels each. You can choose plausible ranges for each hyperparameter. Note that mtry should not be smaller than 1 or larger than 8. Explain why neither of those values would make sense.

What type of model does mtry = 8 represent?

random_forest_model <- rand_forest(mtry = tune(), 
                       trees = tune(), 
                       min_n = tune()) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

random_forest_wflow <- workflow() %>% 
  add_model(random_forest_model) %>%
  add_recipe(pokemon_recipe) 

random_forest_grid <- grid_regular(mtry(range = c(1, 8)),
                        trees(range = c(200,700)),
                        min_n(range = c(1,16)),
                        levels = 8)

Recall that the random forest algorithm utilizes a collection of decision trees, each built upon bagging the training set. Since individual decision trees are sensitive to changes in the training data, taking the mean or mode of a large number of decision trees (trained on slightly different bootstrap aggregated data) removes this sensitivity. The trees hyperparameter is the easiest to describe in this case: it is simply the number of decision trees that we wish to create in our model, and thus the number of predictions that we will either take the mean or mode over to get our final result. The mtry and min_n, on the other hand, control more of the behavior of the individual decision trees. The mtry hyperparameter controls the number of predictors that are randomly chosen to be considered for each split in a single tree. Similarly, min_n sets the minimum number of observations needed to split a node — since a large number of splits in the predictor space can lead to complex trees, small values of mtry with large values of min_n can lead to very complicated trees that can potentially overfit the data.

Since our recipe specifies that the outcome variable type_1 only depends on the 8 predictors legendary, generation, sp_atk, attack, speed, defense, hp, and sp_def, it does not make sense to set mtry larger than 8 since we cannot randomly sample more predictors than we have. Similarly, it doesn’t make sense to randomly sample NO predictors, since it then is impossible to split the parameter space to make decision trees.

Lastly, a random forest model with mtry=8 simply means every decision tree will be built off of splitting the entire predictor space (i.e. all predictors are considered when building our decision trees).

Exercise 8

Fit all models to your folded data using tune_grid().

Note: Tuning your random forest model will take a few minutes to run, anywhere from 5 minutes to 15 minutes and up. Consider running your models outside of the .Rmd, storing the results, and loading them in your .Rmd to minimize time to knit. We’ll go over how to do this in lecture.

Use autoplot() on the results. What do you notice? Do larger or smaller values of penalty and mixture produce better ROC AUC? What about values of min_n, trees, and mtry?

What elastic net model and what random forest model perform the best on your folded data? (What specific values of the hyperparameters resulted in the optimal ROC AUC?)

elastic_net_tune  <- tune_grid(
  elastic_net_wflow ,
  resamples = pokemon_folds,
  grid = elastic_net_grid
)

save(elastic_net_tune, file = "tuned_grids/elastic_net_tune.rda")

random_forest_tune <- tune_grid( 
  random_forest_wflow,
  resamples = pokemon_folds,
  grid = random_forest_grid
)

save(random_forest_tune, file = "tuned_grids/random_forest_tune.rda")
load("tuned_grids/elastic_net_tune.rda")
load("tuned_grids/random_forest_tune.rda")


g1<-autoplot(elastic_net_tune) + ggtitle("Elastic Net") + theme_dark()
ggplotly(g1)





From the above plot, we see that larger values of mixture tend to make our models more accurate when penalty is considerable small (in other words, when the penalty value is lowest at 0.01, Lasso tends to do better than Ridge). While the accuracy plot seems to indicate that all mixtures behave roughly the same when larger values of penalty are considered, the roc_auc performance plot indicates that pure Ridge regression performs much better than pure Lasso regression in this case.

g2 <- autoplot(random_forest_tune)
pp2 <- ggplot(g2$data, aes(x = value, y = mean, group = `# Trees`, color = `# Trees`)) +
  geom_path() +
  geom_point(size = 1) +
  facet_grid(`.metric` ~ `Minimal Node Size`, scales = "free_y",  labeller = label_both) +
  labs(x = paste(unique(g2$data$name)),
       y = "") +
  ggtitle("Random Forest") +
  theme_dark() +
  theme(text = element_text(size = 6)) 
ggplotly(pp2)

For random forest, we notice that the roc_auc quickly improves with the number of randomly selected predictors (i.e. the mtry hyperparameter), but begins to taper off as more that \(\geq 5\) randomly selected predictors are considered. There isnt much indication from the plots above that the minimal node size min_n or the number of trees trees has as much affect on the accuracy of the model — however, it is worth noting that the behavior of models with a larger number (e.g. the Pink Curve in the case of trees = 700) of trees seems to fluctuate much less than when the number of trees is smaller.

Ultimately, our best performing (in terms of roc_auc) elastic net model had penalty = 0.01 and a mixture of mixture=0.11111, meaning that it was much closer to Ridge regression in terms of regularization. In addition, our best random forest model had 342 trees (i.e. trees = 342), randomly selected 3 predictors for each split (i.e. mtry = 3) and required a minimum of 5 observations to split a node in a decision tree (i.e. min_n = 5)

Exercise 9

Select your optimal random forest modelin terms of roc_auc. Then fit that model to your training set and evaluate its performance on the testing set.

Using the training set:

  • Create a variable importance plot, using vip(). Note that you’ll still need to have set importance = "impurity" when fitting the model to your entire training set in order for this to work.

    • What variables were most useful? Which were least useful? Are these results what you expected, or not?

Using the testing set:

  • Create plots of the different ROC curves, one per level of the outcome variable;

  • Make a heat map of the confusion matrix.

random_forest_final_wflow <- select_best(random_forest_tune , metric="roc_auc" ) %>%
  finalize_workflow(x=random_forest_wflow)

random_forest_fit  <- fit(random_forest_final_wflow , pokemon_train)

random_forest_fit %>% extract_fit_parsnip() %>% 
  vip() 

From the above variable importance plot, we see that the base stats of a Pokémon are the most important factors in determining the primary type of a pokemon, while the least important factors were the generation and legendary status — in fact, the legendary status didnt even make it onto the top 10 most important variables. This is ultimately expected, since the distribution of Pokémon types added in each generation is expected to be roughly the same.





set.seed(3435)

random_forest_test <- fit(random_forest_final_wflow, pokemon_test)
random_forest_test_res <- augment(random_forest_test, new_data = pokemon_test)

random_forest_test_res %>%
  roc_curve(type_1 , c(.pred_Bug, .pred_Fire, .pred_Normal, .pred_Grass, .pred_Psychic, .pred_Water, .pred_Other) ) %>%
  autoplot()

random_forest_test_res %>% 
  conf_mat(truth = type_1, estimate = .pred_class) %>% 
  autoplot(type = "heatmap")

Exercise 10

How did your best random forest model do on the testing set?

Which Pokemon types is the model best at predicting, and which is it worst at? (Do you have any ideas why this might be?)


Ultimately, the confusion matrix indicates that our best random forest model seems to perfectly predict the testing data (since the only nonzero entries are on the diagonal). However, the roc_curve plots seem to indicate that our model predicts the Grass and Bug types much better than the remaining five levels of our outcome — this may be due to the fact that the base statistics of grass and bug models have very identifiable characteristics, such as higher Speed. On the other hand, our model seemed to be worst at predicting the Normal and Other types — this makes sense for Other since we are combining a large number of classes into one outcome level, thus diluting any characteristics of the original classes.

For 231 Students

Exercise 11

In the 2020-2021 season, Stephen Curry, an NBA basketball player, made 337 out of 801 three point shot attempts (42.1%). Use bootstrap resampling on a sequence of 337 1’s (makes) and 464 0’s (misses). For each bootstrap sample, compute and save the sample mean (e.g. bootstrap FG% for the player). Use 1000 bootstrap samples to plot a histogram of those values. Compute the 99% bootstrap confidence interval for Stephen Curry’s “true” end-of-season FG% using the quantile function in R. Print the endpoints of this interval.

# Double-Checks that randomness is fixed every time this is run
set.seed(3435)

# Create dataframe accurately representing that 337 of the shots were successful
# Using the sample() function we can ensure this distribution is (somewhat) random
curry_three <- data.frame(attempt = 1:801)
curry_three$made_attempt <- numeric(nrow(curry_three))
curry_three$made_attempt[sample(nrow(curry_three), 337)] <- 1

# Double check that we indeed have 337 makes and 464 misses
curry_three %>%
  group_by(made_attempt) %>%
  summarize(count=n())
## # A tibble: 2 × 2
##   made_attempt count
##          <dbl> <int>
## 1            0   464
## 2            1   337
# Bootstrap our data
curry_boots <- bootstraps(curry_three, times=1000)

# create a function which extracts the boostrap sample
# using the assessment() function, and then computes the mean
# of the made_attempt predictor
sample_mean_boots <- function(split){
  return(mean(assessment(split)$made_attempt))
}
# Use mutate() + map() as explained in Lab 4
boot_res <- curry_boots %>%
  mutate(models = map(splits,sample_mean_boots)) %>%
  unnest(models)

# Plot histogram
p<-ggplot(boot_res, aes(x=models))+
  geom_histogram(color="darkblue", fill="lightblue", binwidth = 0.002) + labs(x="Sample Mean")
ggplotly(p)
# Print the 99% confidence interval for end of season FG%
quantile(boot_res$models, probs = c(0.005, 0.995))
##      0.5%     99.5% 
## 0.3605221 0.4817411

Exercise 12

Using the abalone.txt data from previous assignments, fit and tune a random forest model to predict age. Use stratified cross-validation and select ranges for mtry, min_n, and trees. Present your results. What was your final chosen model’s RMSE on your testing set?

abalone <- read.csv("data/abalone.csv")
abalone <- transform(abalone, age=rings+1.5)


abalone_split <- initial_split(abalone, prop = 0.7, strata = age)
abalone_train <- training(abalone_split)
abalone_test <- testing(abalone_split)

abalone_folds  <- vfold_cv(abalone_train, v = 5, strata = age)

abalone_recipe <- recipe(age ~ type + longest_shell + diameter +
                           height + whole_weight + shucked_weight +
                           viscera_weight + shell_weight,
                         data = abalone_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_interact(terms= ~ starts_with("type"):shucked_weight) %>%
  step_interact(terms= ~ longest_shell:diameter) %>%
  step_interact(terms= ~ shucked_weight:shell_weight) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) 

prep(abalone_recipe) %>% 
  bake(new_data = abalone_train) 

random_forest_model_abalone <- rand_forest(mtry = tune(), 
                       trees = tune(), 
                       min_n = tune()) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

random_forest_wflow_abalone <- workflow() %>% 
  add_model(random_forest_model_abalone) %>%
  add_recipe(abalone_recipe) 

random_forest_grid_abalone <- grid_regular(mtry(range = c(1, 7)),
                        trees(range = c(200,700)),
                        min_n(range = c(1,16)),
                        levels = 10)


Similar to Exercise 8 above, the results of our random forest grid take a significant amount of time to compute; thus, we store our results in an .rda file and save them for later steps.

random_forest_tune_abalone <- tune_grid(
  random_forest_wflow_abalone,
  resamples = abalone_folds,
  grid = random_forest_grid_abalone
)

save(random_forest_tune_abalone, file = "tuned_grids/random_forest_tune_abalone.rda")

Plotting the results of our grid search below, we notice that increasing the minimial node size min_n generally yields much better results in terms of both our RMSE and R^2 metrics. However, increasing the number of randomly selected predictors mtry past mtry=3 seems to have little effect.

load("tuned_grids/random_forest_tune_abalone.rda")

g2 <- autoplot(random_forest_tune_abalone) +    theme(text = element_text(size = 3)) 
pp2 <- ggplot(g2$data, aes(x = value, y = mean, group = `Minimal Node Size`, color = `Minimal Node Size`)) +
  geom_path() +
  geom_point(size = 1) +
  facet_grid(`.metric` ~ `# Randomly Selected Predictors`, scales = "free_y",  labeller = label_both) +
  labs(x = paste(unique(g2$data$name)),
       y = "") +
  ggtitle("Random Forest") +
  theme_dark() +
  theme(text = element_text(size = 4))
ggplotly(pp2)

Ultimately, our best model (in terms of the RMSE) had 644 trees (i.e. trees = 644), randomly selected 6 predictors for each split (i.e. mtry = 6) and required a minimum of 16 observations to split a node in a decision tree (i.e. min_n = 16). One way to interpret this is that a forest made up of a large number of simpler trees performed better (on this particular regression problem) than a small number of much more complex trees.

random_forest_final_wflow_abalone <- select_best(random_forest_tune_abalone , metric="rmse" ) %>%
  finalize_workflow(x=random_forest_wflow_abalone)


random_forest_fit_train_abalone  <- fit(random_forest_final_wflow_abalone , abalone_train)

abalone_training_results <- predict(random_forest_fit_train_abalone, new_data = abalone_train %>% select(-age)) %>%
                          bind_cols(abalone_train %>% select(age))

random_forest_fit_test_abalone  <- fit(random_forest_final_wflow_abalone , abalone_test)

abalone_testing_results <- predict(random_forest_fit_test_abalone, new_data = abalone_test %>% select(-age)) %>%
                          bind_cols(abalone_test %>% select(age))

tibble(dataset=c("RMSE", "R^2"),
       training=c((abalone_training_results  %>% rmse( age, .pred))$.estimate,
              (abalone_training_results  %>% rsq( age, .pred))$.estimate),
       testing = c((abalone_testing_results  %>% rmse( age, .pred))$.estimate, 
               (abalone_testing_results  %>% rsq( age, .pred))$.estimate)
)
## # A tibble: 2 × 3
##   dataset training testing
##   <chr>      <dbl>   <dbl>
## 1 RMSE       1.32    1.36 
## 2 R^2        0.850   0.838

Ultimately, our final model performed quite well on the testing set with a testing RMSE of 1.36 (roughly a 3% increase from the training RMSE).